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Abstract 

Human action recognition has been an important topic in computer vision 
due to its many applications such as video surveillance, human machine in¬ 
teraction and video retrieval. One core problem behind these applications 
is automatically recognizing low-level actions and high-level activities of in¬ 
terest. The former is usually the basis for the latter. This survey gives an 
overview of the most recent advances in human action recognition during 
the past several years, following a well-formed taxonomy proposed by a pre¬ 
vious survey [1]. From this state-of-the-art survey, researchers can view a 
panorama of progress in this area for future research. 
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1. Introduction 

Human action recognition is an active topic in the held of computer vision. 
This is due partially to the rapidly increasing amount of video records and 
the large number of potential applications based on automatic video analysis 
such as visual surveillance, human-machine interfaces, sports video analysis, 
and video retrieval. Among these applications, one of the most interesting is 
human action recognition especially high-level behavior recognition. 

An action is a sequence of human body movements, and may involves 
several body parts concurrently. From the viewpoint of computer vision, the 
recognition of action is to match the observation (e.g. video) with previously 
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defined patterns and then assign it a label, i.e. action type. Depending on 
complexity, human activities can be categorized into four levels: gestures, 
actions, interactions and group activities [1], and much research follows a 
bottom-up construction of human activity recognition. Major components 
of such systems include feature extraction, action learning and classihcation, 
and action recognition and segmentation [61]. A simple process consists 
of three steps, namely detection of human and/or its body parts, tracking, 
and then recognition using the tracking results. For instance, to recognize 
’’shaking hands” activities, two person’s arms and hands are hrst detected 
and tracked to generate a spatial-temporal description of their movement. 
This description is compared with existing patterns in the training data to 
determine the action type. 

This paradigm heavily relies on the accuracy of tracking, which is not 
reliable in cluttered scenes. Many other methodologies were proposed, and 
can be classihed according to many different criteria as in existing survey 
papers. Poppe [61] discussed human action recognition from image repre¬ 
sentation and action classihcation separately. Weinland et a/. [84] surveyed 
methods for action representation, segmentation and recognition. Turaga et 
al. [77] divided the recognition problem into action and activity according to 
its complexity, and classihed approaches according to their ability to handle 
varying degrees of complexity. There exist many other classihcation crite¬ 
ria [1, 11, 9]. Among them,Aggarwal and Ryoo [1] is one of the latest com¬ 
prehensive summarization and comparison of the most signihcant progress 
in this area. Based on whether the action is recognized from input images 
directly, Aggarwal and Ryoo [1] divides the recognition methodologies into 
two major categories: single-layered approaches and hierarchical approaches. 
Both are further sub-categorized depending on the feature representation and 
learning methods, as shown in Fig. 1. [1] surveyed progress up to three years 
ago. 

In this paper, we focus on the state-of-the-art research not discussed 
in previous surveys. Additionally, in order for a comparison with previous 
methods, we use a similar taxonomy as in Aggarwal and Ryoo’s survey[l]. 
For each of the category in Fig. 1, recent development is presented together 
with the comparison between it and previously reported methods. 

The remainder of this paper is structured as follows. Publicly available 
datasets for human action recognition are reviewed in Section 2, followed by 
two sections that review recognition approaches. In Section 3, single-layered 
recognition approaches are reviewed with different representation and inte- 
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Figure 1: Hierarchical approach-based taxonomy of human activity recognition 
methodologies[l]. 


gration methods. Section 4 discusses the advances in hierarchical methdolo- 
gies. Section 5 concludes this survey. 

2. Datasets 

In this section we discuss and describe datasets in use since 2009. Datasets 
that have been utilized earlier than 2009 can be found in [1] in more detail. 
We focus on new datasets collected and we further analyze and compare them 
across several aspects. 

2.1. The KTH Dataset 

The current database covers six actions - walking, jogging, running, box¬ 
ing, hand waving and hand clapping performed several times by 25 subjects 
in four different scenarios outdoors, outdoors with scale variation, outdoors 
with different clothes and indoors. It contains a total of 2391 sequences. All 
sequences are taken with a static camera with 25fps frame rate, down sam¬ 
pled to the spatial resolution of 160x120 pixels. In the original paper [68], 
sequences were divided into a training set (eight persons), a validation set 
(eight persons) and a test set (nine persons). The dataset does not provide 
background models and extracted silhouettes. 

2.2. The Weizmann Dataset 

The database covers 10 natural actions - running, walking, skipping, 
jumping-jack, jumping-forward-on-two-legs, jumping-in-place-on-two-legs, gal¬ 
loping sideways, waving-two-hands, waving one- hand and bending performed 
by nine subjects [3]. It contains a total of 93 sequences. All sequences are 
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taken with a static camera with 25fps frame rate, down sampled to the spatial 
resolution of 180x144 pixels. The dataset also has ten additional sequences 
of walking captured from a different viewpoint varying between 0 and 81 rel¬ 
ative to the image plane. The extracted masks after background subtraction 
and background sequences are provided. 

2.3. The IXMAS Dataset 

INRIA Xmas Motion Acquisition Sequences (IXMAS) covers 13 daily- 
life actions - checking watch, crossing arms, scratching head, sitting down, 
getting up, turning around, walking, waving, punching, kicking, pointing, 
picking, overhead throwing and bottom up throwing performed three times 
by 11 subjects [83]. It contains a total of 2145 sequences. All sequences 
are hlmed with 5 calibrated and synchronized hre wire cameras. Dataset 
provides the extracted silhouettes and also reconstructed visual hulls. 

2 . 4 . CMU MoBo Dataset 

The CMU Motion of Body (MoBo) dataset covers four different actions 
- slow walking, fast walking, inclined walking, and walking with a ball - 
performed by 25 subjects walking on a treadmill in the CMU 3D room [20]. 
More than 8000 images are captured per subject. All sequences are taken 
using six high resolution color cameras. The sequences are 11 seconds long at 
30 fps frame rate with resolution of 640x480 pixels. The extracted silhouettes 
are provided. 

2.5. HOHA-1 (Hollywood Human Actions I) Dataset 

The database contains video samples covering eight actions - answering 
phone, getting out a car, hand shaking, hugging, kissing, sitting down, sit¬ 
ting up, and standing up - from 32 movies [41]. The two training sets are 
originated from 12 movies with 219 samples and test set is originated from 
20 movies other than used in training with 211 samples with labels verihed 
manually. 

2.6. HOHA-2 (Hollywood Human Actions H) Dataset 

This dataset is an extension of the HOHA dataset. The database contains 
video samples covering 12 actions - answering phone, getting out a car, hand 
shaking, hugging, kissing, sitting down, sitting up, standing up, driving car, 
eating, hghting, and running - and 10 classes of scenes from 69 movies [49] . 
The classes of scenes are leaving house, road and entering bedroom, car, hotel. 
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kitchen, living room, office, restaurant, and shop. It contains a total of 3669 
samples. The training set originates from 33 movies with 823 samples. The 
test set originates from 36 movies other than those used in training with 884 
samples having labels verihed manually. 

2.1. Human Eva Dataset 

The Human Eva-I dataset covers four gray scale video sequences and three 
color video sequences from a motion capture system which are calibrated and 
synchronized with 3D body poses. The database contains 4 subjects covering 
6 actions - walking, jogging, gesturing, catching, boxing and combination of 
walking and jogging [72]. The sequences are with resolution of 640x480 
pixels captured at 60 Hz. 

The Human Eva H dataset covers extended sequence of combination of 
walking and jogging actions with two subjects. 

2.8. CMU Mocap Dataset 

The CMU Mocap Dataset has six categories - Human Interaction, Inter¬ 
action with Environment Locomotion, Physical Activities & Sports , Situa¬ 
tions & Scenarios and Test Motions performed by 144 subjects. These six 
categories are subdivided into 23 subcategories. The actions are captured by 
12 Vicon infrared MX-40 cameras with a resolution of 120 megapixel [78]. 

Above datasets and other datasets - UCF Sports action, UCF Youtube 
action and i3DPost Multi-View are summarized in Table 1 
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Table 1: Human Action Dataset 


Dataset 

Challenges 

Year 

Best Accuracy 
Achieved 

Category 

KTH 

Homogeneous backgrounds 
with a static camera 

2004 

97.6% [Ziaeefard 
et a/.’10] 

General pur¬ 
pose action 

recognition 

Weizmann 

partial occlusions, non-rigid 
deformations, significant 

changes in scale and view¬ 
point, high irregularities 
in the performance of an 
action and low quality 
video 

2005 

100% [yangwang 
et al.09; Lin et 
al.09; Zeng and 
Ji et a/.’10] 

General pur¬ 
pose action 

recognition 

IXMAS 

Multi view dataset for view- 
invariant human actions 

2006 

89.4% [Xinxiao 
Wu et a/.’ll] 

Motion Acquisi¬ 
tion 

CMU MoBo 

Human gait 

2001 

78.07% [Qinfeng 
Shi et a/.’ll] 

Motion capture 

HOHA 

Unconstrained videos 

2008 

56.8% [Andrew 
Gilbert et a/.’ll] 

Movie 

HOHA-2 

comprehensive benchmark 
for human action recogni¬ 
tion 

2009 

58.3% [Heng 

Wang et a/.’ll] 

Movie 

Human Eva 

synchronized video and 
ground-truth 3D motion 

2009 

84.3% [Sang Min 
Yoon et a/.’10] 

Pose Estimation 
and Motion 

Tracking 

CMU MoCap 

3D marker positions and 
Skeleton movement 

2006 

100% [Hu et 
a/.’09] 

Motion capture 

UCF Sports 

wide range of scenes and 
viewpoints 

2008 

93.5% [Simon 
Jones et a/.’ll] 

Sports action 

UCF Youtube 

Unconstrained videos 

2008 

84.2% [Heng 

Wang et a/.’ll] 

Sports action 

iSDPost Multi- 

Synchronised / uncompressed- 

2009 

80% [Michael B. 

Motion Acquisi¬ 

View 

HD 8 view image sequences 


Holte et a/.’ll] 

tion 




3. Single-layered Approaches 

This section reviews the single-layered approaches. The methods are char¬ 
acterized by the activities to be recognized directly from the raw video data 
instead of primitive snb-actions or snb-activities. Therefore, most single¬ 
layered approaches deal with simple video or datasets such as KTH to rec¬ 
ognize the actions contained. The image sequences from videos are regarded 
as being generated from a specihc class of actions, and thus such approaches 
basically involve how to represent the videos (i.e. extracting features) and 
match them. As such, single-layered approaches mainly recognize common 
actions and these recognized simple primitive actions can be employed to 
detect more complex action recognition using hierarchical conbinations, as 
shown in Section 4. 

As shown in a previous survey [1], various approaches have been pro¬ 
posed for representation and matching in single-layered systems. They can 
be broadly categorized into two classes: space-time approaches and sequen¬ 
tial approaches. The core difference between space-time and sequential ap¬ 
proaches is how the temporal dimension (i.e., the third-dimension in a 3-D 
XYT space) is treated. Space-time approaches treat time as a regular di¬ 
mension as spatial dimensions and extract features from the 3-D volumetric 
videos, while sequential approaches consider a human activity as ordered ob¬ 
servations along the timeline. Because they take sequential relationships into 
consideration, sequential approaches generally achieve better results than its 
space-time counterpart. 

In this section, we present a review to the most recent progress in this 
branch of action recognition, and made comparison among them and previous 
surveyed methods. Space-time approaches are discussed in Section 3.1, and 
sequential approaches in Section 3.2. 

3.1. Advances in Space-Time Approaches 

For most action recognition systems (also the scope of this survey), the 
input is from videos. All videos discussed here consist of a temporal (T) 
sequence of 2-D spatial (XY) images, or equivalently a set of pixels in 3-D 
XYT space. Therefore, a video can be represented as a spatial-temporal 
volume, and this volume contains necessary information for human beings 
and machines to recognize the actions and activities in the volume. Based 
on this assumption, various representation and correspondence matching al¬ 
gorithms have been put forward to compactly characterize the underlying 
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motion patterns. 

As shown in Fig. 1, we discuss the progress of space-time approaches using 
the same representation-based taxonomy. Except for methods using the raw 
volume as a feature, all three representations use motion-related information 
to characterize the actions or activities. 

3.1.1. Action Recognition with Space-Time Volumes 

The most intuitive space-time volume approach would use the entire 3-D 
volume as feature or template, and match unknown action videos to existing 
ones to obtain the classification. However, the method suffers from the noise 
and meaningless background information, and therefore, some effort has been 
made to model the foreground movement. 

Based on Bobick and Davis’s [6] work on movement, various approaches 
have been explored to extend it for action recognition. Hu et al. [27] proposed 
to combine both motion history image (MHI) and appearance information 
for better characterization of human actions. Two kinds of appearance-based 
features were proposed. The hrst appearance-based feature is the foreground 
image, obtained by background subtraction. The second is the histogram 
of oriented gradients feature (HOG), which characterizes the directions and 
magnitudes of edges and corners. SMILE-SVM (simulated annealing multiple 
instance learning support vector machines) was proposed for classihcation. 
It aims to obtain a global optimum via simulated annealing method without 
relying on model initialization to avoid local minima. 

Qian et al. [62] combined global features and local features to classify and 
recognize human activities. The global feature was based on binary motion 
energy image (MEI), and its contour coding of the motion energy image was 
used instead of MEI as a better global feature because it overcomes the lim¬ 
itation of MEI where hollows exist for parts of human blob are undetected. 
For local features, an object’s bounding box was used. The feature points 
were classified using multi-class support vector machines. Roh et al. [64] also 
exended Bobick and Davis’s [6] MHI from 2-D to 3-D space, and proposed 
volume motion template for view-independent human action recognition us¬ 
ing stereo videos. 

Similarly, motivated by a gait energy image [23], Kim et al. [38] proposed 
an accumulated motion image (AMI) to represent spatiotemporal features of 
occurring actions. The AMI was the average of image differences. A rank 
matrix was obtained using ordinal measurement of AMI pixels. The distance 
between rank matrices of query video and candidate video was computed 


using Ll-norms, and the best match, spatially and temporally, was the can¬ 
didate with the minimum distance. 

Various researchers tried to incorporate person models such as silhou¬ 
ettes or skeletons for action recognition. Ikizler and Duygulu [29] proposed a 
new pose descriptor called histogram of oriented rectangles(HOR) for action 
recognition. They represented each human pose in an action sequence with 
oriented rectangular patches extracted over the human silhouette, which then 
formed spatial oriented histograms to represent the distribution of these rect¬ 
angular patches. The local dynamics was captured with the summation of 
the HOR within a sliding window. Four matching methods were performed 
for classihcation, namely nearest neighbor, global histogramming, SVM and 
dynamic time warping. 

Fang et al. [15] hrst mapped the high dimensional silhouettes to low 
dimensional points as spatial motion description using locality preserving 
projection. This low-dimensional motion vector was assumed to describe the 
intrinsic motion structure. Then three different temporal information, i.e. 
temporal neighbor, motion difference and motion trajectory, was applied to 
the spatial descriptors to obtain the feature vectors, which were fed with 
/c-nearest neighborhood classiher. 

Ziaeefard and Ebrahimnezhad [94] proposed the cumulative skeletonized 
image (CSI) across time as features, and constructed 2-D angular/distance 
histograms based on it. A hierarchical SVM was used for the matching pro¬ 
cess. First a coarse classihcation of CSI histograms using an SVM classiher 
was obtained with dissimilar actions, and then a second SVM was applied to 
confused actions using salient features among similar actions. 

Wang and Mori [82] proposed semilatent topic models (STM) following 
the bag-of-words framework, where a ’’word” corresponds to a frame and a 
’’document” corresponds to a ’’video sequence”. After obtaining stabilized 
persons in a video sequence, optical how was computed, and half-wave rec- 
tihed into four channels followed by hltering to form the motion descriptor, 
based on which codebook was constructed. Based on latent topics models 
such as LDA [5] and CTM [4] , STM does not require a choice for the number 
of latent topics, yet gave better training efficiency and recognition accuracy. 

Guo [21] viewed an action as a temporal sequence of local shape-deformations 
of centroid-centered object silhouettes. Each action was represented by the 
empirical covariance matrix of a set of 13-dimensional normalized geometric 
feature vectors that captured the shape of the silhouette tunnel. The simi¬ 
larity of two actions was measured in terms of a Riemannian metric between 
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their covariance matrices. The silhouette tunnel of a test video is broken into 
short overlapping segments and each segment was classihed using a dictionary 
of labeled action covariance matrices and the nearest neighbor rule. 



Figure 2: An example of computing the shape-motion descriptor of a gesture frame with 
a dynamic background from Lin et al. [43] (©2009 IEEE), (a) Raw optical flow field, (b) 
Compensated optical flow field, (c) Combined, part-based appearance likelihood map, (d) 
Motion descriptor Dm computed from the raw optical flow field, (e) Motion descriptor Dm 
computed from the compensated optical flow field, (f) Shape descriptor Dg. 

Efforts in other directions have also occurred. Kim and Cipolla [37] ex¬ 
tended Canonical Correlation Analysis (CCA) to measure video-to-video sim¬ 
ilarity. The method acted upon video volumes avoiding the difficult problems 
of explicit motion estimation, and provided a way of spatiotemporal match¬ 
ing that is robust to intraclass variations of action due to CCA. Liu et al. [44] 
applied principal component analysis (PCA) to a salient action unit (SAU) 
(i.e., one cycle of repetitive action in a video), and AdaBoost classiher was 
used to classify the action in a query video. Cao et al. [10] provided a new way 
to combine different features using a heterogeneous feature machine (HEM). 

3.1.2. Action Recognition with Space-Time Trajectories 

Trajectory-based approaches are based on the observation that the track¬ 
ing of joint positions is sufficient for humans to recognize actions [33]. Tra¬ 
jectories are usually constructed by tracking joint points or other interest 
points on human body. Various representations and corresponding algo¬ 
rithms match the trajectories for action recognition. 
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Messing et al. [51] extracted feature trajectories by tracking HarrisSD 
interest points using a KLT tracker [46], and the trajectories were represented 
as sequences of log-polar quantized velocities. It used a generative mixture 
model to learn a velocity-history language and classihed video sequences. A 
weighted mixture of bags of augmented trajectory sequences was modeled 
for action classes. These mixture components can be thought of as velocity 
history words, with each velocity history feature being generated by one 
mixture component, and each activity class has a distribution over these 
mixture components. Further, they showed how the velocity history feature 
can be extended, both with a more sophisticated latent velocity model, and 
by combining the velocity history feature with other useful information, like 
appearance, position, and high level semantic information. 

Wang et al. [80] proposed an approach to describe videos by dense trajec¬ 
tories. They sampled dense points from each frame and tracked them based 
on displacement information from a dense optical flow held. Local descrip¬ 
tors of HOG, HOF and MBH (motion boundary histogram) around interest 
points were computed. This is shown in Fig. 3. 



Figure 3: Illustration of dense trajectory description from [80] (©2011 IEEE) Left: Feature 
points are sampled densely for multiple spatial scales. Middle:Tracking is performed in the 
corresponding spatial scale over L frames. Right: Trajectory descriptors of HOG, HOF 
and MBH. 


3.1.3. Action Recognition with Space-Time Local Features 

The application of local features in action recognition was extended from 
object recognition in images. The local features refer to the description 
of points and their surroundings in the 3-D volumetric data with unique 
discriminative characteristics. These points and corresponding local feature 
descriptors are most informative and more robust. In terms of the density 
of extracted feature points, the representation of local feature approaches 
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can be divided into two broad categories: sparse and dense. The HarrisSD 
detector [40] and the Dollar detector [14] are representative of the former, 
and optical flow-based methods the latter. Most algorithms are derived from 
them. Other novel methods have also been applied for hnding interest points 
to recognize actions. 

Bregonzio et al. [8] proposed clouds of space-time interest points to over¬ 
come the limitations of the Dollar detector [14]. Using the detected inter¬ 
est points from [14], this was achieved through extracting holistic features 
from clouds of interest points accumulated over multiple temporal scales fol¬ 
lowed by automatic feature selection. SVMs and Nearest Neighbor Classihers 
(NCCs) were employed for classihcation. One example of clouds of interest 
points is shown in Fig. 4. Jones, et al. [34] also based their research on 
the Dollar detector [14] to detect and describe interest points which were 
then clustered using k-means. The innovation is that it incorporated rele¬ 
vance feedback mechanism by using ABRS-SVM (i.e., asymmetric bagging 
and random subspace support vector machine). 

In [75], space-time interest points are detected with the HarrisSD detec¬ 
tor [40], and assigned labels of { — 1,1} indicating if it belongs to the class of 
interest action by using a Bayesian classiher. The feature vectors of interest 
point descriptors and labels are then provided to a PCA-SVM classiher to 
recognize the action type. In this work, the action is also localized based on 
CRF weighting results. 

While 3D Harris corners [40] are widely used, they suffer the problem of 
sparity. Gilbert et al. [18] used dense simple 2D Harris corners [25] in multiple 
scales to construct features. A two stage hierarchical grouping process was 
used to classify features and the actions. Sadek et al. [67] also used a Harris 
corner detector in each frame and described the local feature points with 
temporal self-similarities dehned on the fuzzy log-polar histograms. Together 
with global features (i.e., change of gravity centers), the feature vectors were 
classihed with SVM. 

Optical how is also commonly used for feature point detection and descrip¬ 
tion [30, 26, 58]. Ikizler-Cinbis and Sclaroh [30] employed optical how and 
foreground how to extract motion features for persons, objects and scenes, 
based on which the shape feature for each was also extracted. All of these 
feature channels were inputs to a multiple instance learning (MIL) framework 
to hnd the location of interest in a given video. 

Holte et al. [26] constructed 3D optical how from eight weighted 2D how 
helds to achieve view-invariant action recognition. 3D Motion Context (3D- 
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MC) and Harmonic Motion Context (HMC) were used to represent the ex¬ 
tracted 3D motion vector elds efficiently and in a view-invariant manner. The 
resulting 3D-MC and HMC descriptors were classied into a set of human ac¬ 
tions using normalized correlation, taking into account the performing speed 
variations of different actors. 

Another optical flow-based work was Oikonomopoulos’s B-spline polyno¬ 
mial descriptor [58]. It was extracted as spatiotemporal salient points de¬ 
tected on the estimated optical flow held for a given image sequence and was 
based on geometrical properties of three-dimensional piecewise polynomials, 
namely B-splines. The latter was htted on the spatio-temporal locations of 
salient points that fell within a given spatiotemporal neighborhood. The 
descriptor is invariant in translation and scaling in space-time. 



Figure 4: Examples of clouds of interest points. The clouds at different temporal scales 
are highlighted in yellow boxes. [8] (©2009 IEEE) 

Many efforts have been made to hnd interest points with other princi¬ 
ples [63, 52, 90, 69, 47, 93, 42]. For example, Rapantzikos et al. [63] proposed 
a saliency-based interest points detector which incorporates intensity, color 
and motion. It used a multi-scale volumetric representation of the video and 
involved spatiotemporal operations at the voxel level. Interest points were 
selected as the extrema of the saliency response. Different recognition algo¬ 
rithms were used, such as bag-of-words with nearest neighbor for the KTH 
dataset and SVM kernel for HOHA dataset. 
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Minhas et al. [52] proposed new methods to compute the spatiotemporal 
features using 3D dual-tree discrete wavelet tranform (DT-DWT). 3D DT- 
DWT was employed to get the spatiotemporal information (subband vector 
of wavelet coefficients) efficiently, and an affine SIFT was used for local static 
features. By using hybrid spatiotemporal and local static features, the ex¬ 
treme learning machine (ELM) classiher reached high accuracy for public 
datasets. 

Yu et al. [90] introduced a framework based on semantic texton forests 
(STFs) to achieve real-time action recognition. The FAST detector [65] 
was extended to V-FAST for video interest point detection. STFs are ap¬ 
plied to classify local space-time volumes around interest points to generate 
the discriminative codebook. Pyramidal spatiotemporal relationship match 
(PSRM) was used for local appearance and structural information. A set 
of 3D relationship histograms were constructed by analyzing every pair of 
feature points using PSRM. 

Zhu et al. [93] proposed a new TISR (temporally integrated spatial re¬ 
sponse) descriptor, which captured the characteristics of individual actions 
by extracting dense spatiotemporal descriptors and representing actions by 
bag-of-words features. With a visual vocabulary of the TISR descriptors, the 
bag-of-words histogram features were able to tolerate spatial and temporal 
variations. 

Le et al. [42] presented an extension of the independent subspace anlysis 
(ISA) algorithm to learn invariant spatiotemporal features from unlabled 
video data in a hierarchical way. More specihcally, features were hrst learnt 
with small input patches flattened into a vector, convolved with a larger 
region of the input data, and then used as input to the layer above. The 
features from both layers were combined as local features for classihcation. 
This two-layered stacked convolutional ISA model overcomes the limitation 
of ISA for large inputs, and performed well on challenging datasets. 


3.2. Sequential Approaches 

Single-layered sequential approaches differ with space-time approaches in 
that they are designed to capture temporal relationships of observations. 
Thus, human actions are integrated as a sequence of observations. Generally 
an observation is associated with local or global features extracted from a 
frame or a set of frames. As in [1] exemplar-based recognition and state 
model-based analysis are two sub-categories of sequential approaches. 
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Table 2: Comparison of space-time approaches 


Approach 

Gategory 

KTH 

WZMN 

Other 

Hu’09 

Volume 



GMU:100% 

Ikizler’09 

Volume 

90% 

100% 


Wang’09 

Volume 

91.2% 

100% 


Guo’09 

Volume 

95.33% 



Kim’09 

Volume 

95.33% 


Gesture:82% 

Cao’09 

Volume 



GMU:88.1% 

Liu’lO 

Volume 

81.5% 

98.3% 


Ziaeefard’lO 

Volume 

97.6% 



Fang’lO 

Volume 


90.21% 


Qian’lO 

Volume 

88.69% 



Kim’lO 

Volume 

96.4% 



Messing’09 

Trajectory 

89% 


DailyAction: 

67% 

Wang’11 

Trajectory 

94.2% 


HOHA2:58.3% 

UGF:88.2% 

Bregonzio’09 

Local 

93.17% 

96.66% 


Rapantzikos’09 

Local 

88.3% 



Minhas’lO 

Local 

94.83% 

99.44% 


Thi’lO 

Local 

93.83% 

98.2% 

HOHA:26.63% 

TREGVid:23.25% 

Ikizler-Cinbis’lO 

Local 



Youtube:72.51% 

Yu’lO 

Local 

95.67% 


UT-Itrctn:83.33% 

he’ll 

Local 

93.9% 


UGF:86.5% 

HOHA2:53.3% 

Youtube:75.8% 

Jones’12 

Local 

93.2% 


UGF:93.5% 

HOHA:48.4% 

Sadek’ll 

Local 

93.6% 

97.8% 


Gilbert’09 

Local 

94.5% 


HOHA:31.4% 

mKTH:68.8% 

Oikonomopoulos 

Local 

81% 

92% 

Aerobics:95% 

Lui’ll 

Local 

97% 


UGF:88% 
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3.2.1. exemplar-based approaches 

As we mentioned earlier, sequential approaches define actions to be a 
sequence of observations and how observations are extracted is not limited. 
Exemplar-based approaches represent human actions with a template se¬ 
quence of observation or a set of sample sequence of action observations. 
Thus the focus of exemplar-based approaches is defining how a new input 
video can be compared with the template or sample sequence of action ob¬ 
servations. In previous work dynamic time warping (DTW) has been widely 
adopted for exemplar-based human action recognition in [13, 16, 79]. The 
similarity between input and action template is measured by comparing coef¬ 
ficients of the activity basis after principal component analysis (PCA) in [85]. 
Dynamic feature changes are also utilized to represent an activity as a linear¬ 
time-invariant (LTI) system [45]. 

Recently Lin et al. [43] represented actions in videos as a sequence of 
prototypes. The prototype is based on a novel shape-motion feature and 
the sequence is generated by matching with a hierachical prototype tree con¬ 
structed using A-means (K=2) clustering applied iteratively. Given an action 
video, prototype sequence will be generated for it with a prototype sequence 
estimation. The prototype matching was fulfilled using FastDTW algorithm 
to increase computational efficiency. 

3.2.2. state model-based approaches 

Instead of representing human action as a sequence of observations state 
model-based approaches learn a state model for each action and each action 
is represented in terms of a set of hidden states. It generates sequences of 
observation and every sequence of observation is associated with an instance 
of the corresponding action. Standard hidden Markov models have been 
widely used for state model-based approaches in [86, 74, 7]. HMMs are also 
extended to CHSMMs to model duration of human activities [48, 56]. 

Currently, HMMs or extensions are still applied in human action recog¬ 
nition. In [89], a flexible star skeleton is described for use in posture repre¬ 
sentation. The aim is to accurately match human extremities using contours 
and histograms from an image frame. An HMM is utilized to recognize 
human actions. In [36], novel texture descriptors are proposed to describe 
motion and an HMM is used to model the temporal development of texture 
motion histograms.In [70], a discriminative semi-Markov model approach is 
proposed and in order to efficiently solve the inference problem of simultane¬ 
ously segmenting and recognizing different actions they designed a Viterbi- 
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Table 3: Comparison of sequential approaches 


Approach 

Category 

KTH 

WZMN 

Other 

Shi’ll 

State-based 

95% 


CMU:78% 

WBD:94% 

Yn’09 

State-based 



HnmanChmbingFences:97.9% 

BalletMovie:93.6% 

Kelloknmpn’09 

State-based 

93.8% 

98% 


Lin’09 

Exemplar 

95.77% 

100% 



like dynamic programming algorithm. Comparision of seqnential approaches 
can be seen in Table 3. 


4. Hierarchical Approaches 

As described in [1] hierarchical approaches try to recognize interesting 
events (high-level activities) based on simpler or low-level snb-activities. In 
other words a high-level activity can be decomposed into a seqnence of several 
snb-activities snch as ’’hand shaking” may be integrated as a seqnence of 
two hands being extended, merging into one object, and two hands being 
withdrawn. Snb-activities can be fnrther considered as high-level activities 
nntil decomposed into atomic ones. 

The advantage of hierarchical approaches is the capability to model the 
complex strnctnre of hnman activities and its flexibility for either individnal 
activities, interaction between hnmans and/or objects or gronp activities. 
Moreover, hierarchical models provide an intnitive and convenient interface 
for integrating prior knowledge and nnderstanding of strnctnre of activities. 
Hierarchical approaches to some extent have a close relationship with single¬ 
layer approaches. For example non-hierarchical single layer approaches can 
be easily ntilized for low-level or atomic action recognition snch as gestnre de¬ 
tection. Some non-hierarchical single layer approaches can also be extended 
to hierarchical models snch as extended mnlti-layered HMMs. 

Using the taxonomy proposed in [1], hierarchical approaches are cate¬ 
gorized into three gronps: statistical approaches, syntactic approaches, and 
description-based approaches. 
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4 .1. statistical approaches 

HMMs can be considered as a simple case of dynamic Bayesian networks. 
An HMM represents the state of the world nsing a single discrete random 
variable however DBN represents the state of the world nsing a set of random 
variables. Mnltiple levels of hidden states form a representation of hierar¬ 
chical hnman activities. Previons research efforts on statistical approaches 
mainly dwell on applications of extended HMMs and dynamic Bayesian net¬ 
works: 2-layered hierarchical hidden Markov models (HMMs) [59, 92, 88] and 
dynamic probabilistic networks (DPNs) also known as dynamic Bayesian net¬ 
works (DBNs) [19, 12]. Snb-activities can be either concnrrent or seqnential. 
HMM-based approaches in the literatnre handle seqnential snb-activities. 
Tims, a hierarchical approach using a propagation network (P-net) [71] has 
been proposed to handle both concurrent and sequential sub-activities. Be¬ 
yond HMMs and DBNs a new four-layered hierarchical probabilistic latent 
model is proposed in [87]. First the spatial-temporal features are detected 
and clustered using hierarchical Bayesian model to form atomic actions. 
Then, based on LDA, a hierarchical probabilistic latent model is used to 
recognition the action without the need to specify the number of latent states. 
Local feature-spatial-temporal features are utilized instead of global feature 
such as human gesture. It is an attempt to utilize clustered space-time fea¬ 
tures as atomic actions and hierarchical descriptions and representations of 
complex actions. 

Another statistical approach [24] is to decompose the body into a hier¬ 
archical structure. A hierarchical manifold space is learnt to describe the 
motion patterns. Cascade condition random helds (CRFs) are used to pre¬ 
dict these motion patterns. SVMs are used to classify hnal human actions 
based on the motion patterns. Hierarchical representation of human action 
is proposed rather than simple non-hierarchical bag-of-words representation. 
In [50] hierarchical K-means tree is also used to represent the feature cues. 

The problem of insufficient training data is handled in [91] by integrat¬ 
ing with domain knowledge. First-order logic based domain knowledge is 
exploited for dynamic Bayesian network learning, both the structure and the 
parameters. 

4 -2. syntactic approaches 

Syntactic approaches integrate actions as a string of symbols. A symbol 
in this context is actually the atomic sub-activities mentioned in the previous 
section. Atomic sub-activities can be recognized using any of the previous 
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hierarchical or non-hierarchical techniques. However actions represented as a 
string of symbols results in a limitation for concurrent action recognition. In 
previous work context-free grammers (CFGs), based on syntactic approaches, 
have been studied and applied in human action recognition. Several proba¬ 
bilistic extension of CFGs - stochastic context-free grammers(SCFGs) - are 
introduced in [32, 54, 53, 35]. Generally two-layer frameworks are proposed; 
the lower layer mostly functions to recognize atomic or low-level actions and 
the higher layer uses parsing techniques for the high-level activity recogni¬ 
tion. Another limitation is that user must provide a set of production rules 
and in order to overcome such limitations Kitani et ah [39] introduced an 
algorithm to automatically learn rules from observations. 

Recently efforts have been made towards a new hierarchical framework. 
In [81] a four-level hierarchy is proposed. Actions are represented by a set 
of grammar rules categorized into three classes strong, weak, and stochastic 
relations based on spatio-temporal relations. 

4 . 3 . description-based approaches 

Description-based approaches differ from statistical and syntactic ap¬ 
proaches through a capability to explicitly express human activities’ spatio- 
temporal structures. Thus, such methods are able to recognize both se¬ 
quential and concurrent actions instead being limited to sequential actions. 
Basically, description-based approaches model human activities as an occur¬ 
rence of embedded sub-activities. Such occurrences must satisfy specified 
temporal, spatial and logical relationships that are signatory of a high-level 
activity. Since the introduction of Allen’s temporal interval predicates, they 
have been adopted for description-based human activity recognition for both 
sequential and concurrent relationships. Gontext free grammars have also 
been utilized for description-based approaches. A formal syntax is required 
for the representation of human activities as in [57, 66]. Gonversion from 
Allen’s interval algebra constraint network to a PNF-network is proposed 
in [60] to describe identical temporal information. The conversion achieves a 
form that is computationally tractable. Bayesian belief networks and Petri 
nets are introduced, respectively, in [31] and in [17]. Event logic is described 
by Siskind to recognize high-level activities in [73]. In order to compensate 
for the failures of its low-level components due to the deterministic charac¬ 
teristics of description based approaches several probabilistic extensions of 
the recognition frameworks are proposed in [2, 22]. Symbolic artificial intel¬ 
ligence techniques Markov Logic Networks(MLN) was also adopted to infer 
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Table 4: Comparison of hierarchical approaches 


Approach 

Category 

KTH 

WZMN 

Other 

Yin’lO 

Statistical 

82% 



Zeng’lO 

Statistical 

92.1% 

100% 


Han’10 

Statistical 



CMU:98.27% 

Wang’11 

Syntactic 

92.5% 


HOHA:37.6% 

UCF:68.3% 

Ijsselmuiden’10 

Description-based 



GroupActivities:74.4% 

Morariu’09 

Description-based 



Basketball: 72% 


interesting activities probabilistically as in [76]. 

Ijsselmuiden and Stiefelhagen [28] provide a brief framework for high-level 
human activity recognition. It combines different input sources and is based 
on temporal logic. No probabilistic computation is employed in this work. 

Recently a framework was proposed in [55] to recognize behavior in one- 
to-one basketball by means of arbitrary trajectories obtained by tracking the 
ball, hands, and feet. This framework uses video analysis and mixed proba¬ 
bilistic and logical inference to annotate events. The method requires seman¬ 
tic descriptions of what generally happens in various scenarios. First-order 
logic based on Allen’s Interval Logic is utilized to encode spatio-temporal 
structure knowledge and MLN is used to handle uncertainty low-level obser¬ 
vation. 

Although, much effort has been extended as described previously but 
common standard dataset has not been utilized to certain extent so that 
comparison between description-based approaches can be expressed in terms 
of functionally instead of statistically. Comparison between hierarchical ap¬ 
proaches is shown in Table 4. 


5. Conclusion 

In this letter we provide a survey of advances in automated human action 
recognition. A large collection of methods are identihed. Among them, 50 
specihc and influential proposals of the last three years are reported. The 
discussion uses the same taxonomy as a previous survey based on whether 
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the action is recognized directly from the images or low-level sub-actions. 
Our goal was to cover the state-of-the-art developments in each catetory, 
together with the datasets used in validation. 

The literature reviewed shows that much research has been devoted to 
recognition of human actions directly from the videos or images in a single¬ 
layered manner. This is especially true for the case using space-time volume 
and local features. It is natural to extend 2D image processing methods, 
such as interest point detection, to 3D videos to extract feature descriptors. 
Meanwhile, more and more researchers are beginning to explore methods for 
high-level activity recognition. In this case, most methods surveyed use a 
hierarchical approach, based on statistical, syntactic, or description-based 
methods to explain and infer activities from low-level events. Particularly, it 
is of interest to combine the formal descriptors and probabilistic reasoning 
to interpret human actions, such as done in [73, 57, 66]. 

While some research has focused on complex real-world actions, most 
popular test datasets are still simple, constrained, and structured environ¬ 
ments. For example, the observed actions are simple in the KTH or Weiz- 
mann datasets. Most algorithms achieve high accuracy in recognizing the ac¬ 
tions. The introduction of more realistic datasets such as Hollywood movies 
and Youtube videos are challenging. The accuracy reported is low in the 
literature surveyed here. Based on the results of low-level actions, we hope 
more research will be done in the area of high-level action recognition in 
datasets and real-world scenes. 

We know, however, that complete review of all the approaches is beyond 
reach. As a popular research topic, human action and activity recognition has 
attracted much attention and will remain important. With more and more 
application helds being explored, on one side, domain-specihc techniques 
will probably emerge. On the other side, a cross-domain framework would 
be benehcial to the entire community. 
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